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With different genomes available, unsupervised learning algorithms are es¬ 
sential in learning genome-wide biological insights. Especially, the functional 
characterization of different genomes is essential for us to understand lives. 
In this book chapter, we review the state-of-the-art unsupervised learning 
algorithms for genome informatics from DNA to MicroRNA. 

DNA (DeoxyriboNucleic Acid) is the basic component of genomes. A sig¬ 
nificant fraction of DNA regions (transcription factor binding sites) are bound 
by proteins (transcription factors) to regulate gene expression at different de¬ 
velopment stages in different tissues. To fully understand genetics, it is neces¬ 
sary of us to apply unsupervised learning algorithms to learn and infer those 
DNA regions. Here we review several unsupervised learning methods for de¬ 
ciphering the genome-wide patterns of those DNA regions. 

MicroRNA (miRNA), a class of small endogenous non-coding RNA (Ri- 
boNucleic acid) species, regulate gene expression post-transcriptionally by 
forming imperfect base-pair with the target sites primarily at the 3' untrans¬ 
lated regions of the messenger RNAs. Since the 1993 discovery of the first 
miRNA let-7 in worms, a vast amount of studies have been dedicated to func¬ 
tionally characterizing the functional impacts of miRNA in a network context 
to understand complex diseases such as cancer. Here we review several rep¬ 
resentative unsupervised learning frameworks on inferring miRNA regulatory 
network by exploiting the static sequence-based information pertinent to the 
prior knowledge of miRNA targeting and the dynamic information of miRNA 
activities implicated by the recently available large data compendia, which in¬ 
terrogate genome-wide expression profiles of miRNAs and/or mRNAs across 
various cell conditions. 


1 Introduction 

Since the 1990s, the whole genomes of a large number of species have been 
sequenced by their corresponding genome sequencing projects. In 1995, the 



2 


Ka-Chun Wong Yue Li Zhaolei Zhang 


first free-living organism Haemophilus influenzae was sequenced by the Insti¬ 
tute for Genomic Research [35] . In 1996, the first eukaryotic genome ( Saccha- 
romyces cerevisiase) was completely sequenced [ 551 . In 2000, the first plant 
genome, Arabidopsis thaliana , was also sequenced by Arabidopsis Genome 
Initiative E2- In 2004, the Human Genome Project (HGP) announced its 
completion [32]. Following the HGP, the Encyclopedia of DNA Elements (EN¬ 
CODE) project was started, revealing massive functional putative elements 
on the human genome in 2011 [35] • The drastically decreasing cost of se¬ 
quencing enables the 1000 Genomes Project to be carried out, resulting in an 
integrated map of genetic variation from 1,092 human genomes published in 
2012 pQ. The massive genomic data generated by those projects impose an 
unforeseen challenge for large-scale data analysis at the scale of gigabytes or 
even terabytes. 

Computational methods are essential in analyzing the massive genomic 
data. They are collectively known as bioinformatics or computational biol¬ 
ogy; for instance, motif discovery [53] helps us distinguish real signal subse¬ 
quence patterns from background sequences. Multiple sequence alignment [4] 
can be used to study the similarity between multiple sequences. Protein struc¬ 
ture prediction [115 .. 1168 ] can be applied to predict the 3D tertiary structure 
from an amino acid sequence. Gene network inference [35l are the statistical 
methods to infer gene networks from correlated data (e.g. microarray data). 
Promoter prediction [2] help us annotate the promoter regions on a genome. 
Phylogenetic tree inference [131] can be used to study the hierarchical evo¬ 
lution relationship between different species. Drug scheduling [1011 171 can 
help solve the clinical scheduling problems in an effective manner. Although 
the precision of those computational methods is usually lower than the ex¬ 
isting wet-lab technology, they can still serve as useful preprocessing tools 
to significantly narrow search spaces. Thus prioritized candidates can be se¬ 
lected for further validation by wet-lab experiments, saving time and funding. 
In particular, unsupervised learning methods are essential in analyzing the 
massive genomic data where the ground truth is limited for model training. 
Therefore, we describe and review several unsupervised learning methods for 
genome informatics in this chapter. 


2 Unsupervised Learning for DNA 

In human and other eukaryotes, gene expression is primarily regulated by 
the DNA binding of various modulatory transcription factors (TF) onto cis- 
regulatory DNA elements near genes. Binding of different combinations of 
TFs may result in a gene being expressed in different tissues or at different 
developmental stages. To fully understand a gene’s function, it is essential to 
identify the TFs that regulate the gene and the corresponding TF binding 
sites (TFBS). Traditionally, these regulatory sites were determined by labor- 
intensive experiments such as DNA footprinting or gel-shift assays. Various 
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computational approaches have been developed to predict TF binding sites in 
silico. Detailed comparisons can be found in the survey by Tompa et al. [ 1156 ]. 
TFBS are relatively short (10-20 bp) and highly degenerate sequence motifs, 
which makes their effective identification a computationally challenging task. 
A number of high-throughput experimental technologies were developed re¬ 
cently to determine protein-DNA binding such as protein binding microarray 
(PBM) |16j . chromatin immunoprecipitation (ChIP) followed by microarray 
or sequencing (ChIP-Chip or ChIP-Seq) |12Qlf5TI . microfluidic affinity analysis 
[48] . and protein microarray assays EMU] ■ 

On the other hand, it is expensive and laborious to experimentally iden¬ 
tify TF-TFBS sequence pairs, for example, using DNA footprinting [53], gel 
electrophoresis [55], and SELEX [ 158] . The technology of Chromatin immuno¬ 
precipitation (ChIP) 112!), [5Tj measures the binding of a particular TF to the 
nucleotide sequences of co-regulated genes on a genome-wide scale in vivo , 
but at low resolution. Further processing are needed to extract precise TF- 
BSs [104 ]. Protein Binding Microarray (PBM) was developed to measure the 
binding preference of a protein to a complete set of k-mers in vitro [16] . The 
PBM data resolution is unprecedentedly high, comparing with the other tra¬ 
ditional techniques. The DNA k-mer binding specificities of proteins can even 
be determined in a single day. It has also been shown to be largely consistent 
with those generated by in vivo genome-wide location analysis (ChIP-chip 
and ChIP-seq) [16] . 

To store and organize the precious data, databases have been created. 
TRANSFAC is one of the largest databases for regulatory elements includ¬ 
ing TFs, TFBSs, weight matrices of the TFBSs, and regulated genes [lOD J. 
JASPAR is a comprehensive collection of TF DNA-binding preferences IllSl- 
Other annotation databases are also available (e.g. Pfam [L4j, UniProbe urn 
ScerTF [M8], FlyTF [123], YeTFaSCo [22], and TFcat [52]). Notably, with 
the open-source and open-access atmosphere wide-spreads on the Internet in 
recent years, a database called ORegAnno appeared in 2008 [1131 . It is an 
open-access community-driven database and literature curation system for 
regulatory annotation. The ENCODE consortium has also released a consid¬ 
erable amount of ChIP-Seq data for different DNA-binding proteins [42]. 

In contrast, unfortunately, it is still difficult and time-consuming to extract 
the high-resolution 3D protein-DNA (e.g. TF-TFBS ) complex structures with 
X-Ray Crystallography ]146] or Nuclear Magnetic Resonance (NMR) spectro¬ 
scopic analysis m ■ As a result, there is strong motivation to have unsuper¬ 
vised learningl methods based on existing abundant sequence data, to pro¬ 
vide testable candidates with high confidence to guide and accelerate wet-lab 
experiments. Thus unsupervised learning methods are proposed to provide 
insights into the DNA binding specificities of transcription factors from the 
existing abundant sequence data. 
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2.1 DNA Motif Discovery and Search 

Transcription Factor Binding Sites (TFBSs) are represented in DNA motif 
models to capture its sequence degeneracy am They are described in the 
following sections. 


Representation (DNA Motif Model) 

There are several motif models proposed. For example, consensus string rep¬ 
resentation, a set of motif instance strings, count matrix, position frequency 
matrix (PFM), and position weight matrix (PWM). Among them, the most 
popular motif models are the matrix ones. They are the count matrix, PFM, 
and PWM. In particular, the most common motif model is the zero-order 
PWM which has been shown to be related to the average protein-DNA binding 
energy in the experimental and statistical mechanics study [15]. Nonetheless, it 
assumes independence between different motif positions. A recent attempt has 
been made to generalize PWM but the indel operations between different nu¬ 
cleotide positions are still challenging }150j . Although the column dependence 
and indel operations could be modeled by Hidden Markov Model (HMM) si¬ 
multaneously, the number of training parameters is increased quadratically. 
There is a dilemma between accuracy and model complexity. 

Count Matrix 


The count matrix representation is the de facto standard adopted in databases. 
In the count matrix representation, a DNA motif of width w is represented as 
a 4-by-w; matrix C. The jth column of C corresponds to the jth position of 
the motif, whereas the *th row of C correspond to the ith biological character. 
In the context of DNA sequence, we have 4 characters {A,C,G,T}. CV, is the 
occurring frequency of the i biological character at the jth position. For ex¬ 
ample, the count matrix C sox 9 of the SOX9 protein (JASPAR ID:MA0077.1 
and UniProt ID:P48436) is tabulated in the following matrix form. The motif 
width is 9 so we have a 4 x 9 matrix here. The corresponding sequence logo 
is also depicted in Figure [TJ 


Csox 9 — 



1 

2 

3 

4 

5 

6 

7 

8 

9 

A 

/ 24 

54 

59 

0 

65 

71 

4 

24 

9 \ 

C 

7 

6 

4 

72 

4 

2 

0 

6 

9 

G 

31 

7 

0 

2 

0 

1 

1 

38 

55 

T 

\ 14 

9 

13 

2 

7 

2 

71 

8 

3 / 


Notably, adenine has not been found at the 4th position, resulting in a 
zero value. It leads to an interesting scenario. Is adenine really not found at 
that position or the sample size (the number of binding sites we have found 
so far for SOX9) is too small that the sites having adenine at that position 
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Fig. 1 . DNA Motif Sequence Logo for SOX9 (JASPAR ID:MA0077.1 and UniProt 
ID:P48436). The vertical axis measures the information content, while the horizontal 
axis denotes the positions. The consensus string is DAACAATRG, following the 
standard IUPAC nucleotide code. 


have not been verified experimentally yet ? To circumvent the problem, people 
usually add pseudo counts to the matrix which is justified from the use of prior 
probability in statistics smug. Such techniques are also found in natural 
language computing and machine learning. 

Position Frequency Matrix (PFM) 

In practice, a count matrix is usually converted to PFM, and thus a zero-order 
PWM for scanning a long sequence. The dimension and layout of count matrix 
is exactly the same as those of the corresponding PFM and zero-order PWM. 
Their main difference is the element type. For count matrix, each element is 
simply a count. For PFM, each element is a Maximum Likelihood Estimate 
(MLE) parameter. For zero-order PWM, each element is a weight. 

To derive a PFM F from a count matrix, C , maximum likelihood estima¬ 
tion (MLE) is used 1116) . Mathematically, we aim at maximizing the likelihood 
function L(C) = P(C\F ) = JlHi TlL _i F^ 1 ' . In addition, we impose a param¬ 
eter normalization constraint Xq=i = 1 for each ith position. It is added 
to the likelihood function with a Lagrange multiplier Ai, resulting in a new 
log likelihood function: 


w 4 w 4 

InL'(C) = EE Cjilog(Fji) -f- E^E^-d 

i =1 j =1 i—1 j =1 

By taking its partial derivatives to zero, it has been shown that F rj = 4 3 ' . 

Ej = 1 '-'ji 

The MLE parameter definition is quite intuitive. It is simply the occurring 
fraction of a nucleotide at the same position. For example, given the previous 
SOX9 count matrix C sox g, we can convert it to a PFM F sox g as follows: 
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F SO x9 — 



1 

2 

3 

4 

5 

6 

7 

8 

9 

A 

/ 0.31 

0.68 

0.75 

0.01 

0.83 

0.89 

0.06 

0.31 

0.13 \ 

C 

0.10 

0.09 

0.06 

0.91 

0.06 

0.04 

0.01 

0.09 

0.13 

G 

0.40 

0.10 

0.01 

0.04 

0.01 

0.03 

0.03 

0.49 

0.69 

T 

^ 0.19 

0.13 

0.18 

0.04 

0.10 

0.04 

0.90 

0.11 

0.05 / 


where a pseudocount = 1 is added to each element of C sox g. 

We can observe that the most invariant positions of the SOX9 motif are 
the 4th and 6 th position. At the 4th position, cytosines have been found most 
of the times while guanines and thymines have just been found few times. 

Position Weight Matrix (PWM) 


To scan a long sequence for motif matches using a PFM F, we need to derive a 
PWM M first so that the background distribution can be taken into account. 
As aforementioned, each element of PWM is a weight. Each weight can be 
viewed as a preference score. In practice, it is usually defined as the log likeli¬ 
hood ratio between the motif model and background model. Mathematically, 
Mji = log(^ L ) where Bj is the occurring fraction of the jth nucleotide in all 
the background sequences such that, given a subsequence a of the same width 
as the PWM (width w). we can compute a score S(a) by summation only: 


log P(a|F) 

J P{a\Back ground) 

(1) 

tt"' rr 4 fKW] 

, lli=i 

log -7-4-t 

rr"’ rr 4 r[°<=j] 
lli=l llj=l B j 

(2) 

w 4 pK=j] 

^nii( B w 

*=ij=i 

(3) 

w 4 p.. 

i= l j= l n 3 

(4) 

w 4 

i=l j=l 

(5) 


where cq is the numeric index for the nucleotide at ith position of the subse¬ 
quence a. For example, given the previous SOX9 PFM F sox 9 , we can convert 


it to a PWM M sox 9 

1 

as follows: 

2 3 

4 

5 

6 

7 

8 

9 

A 

( 0.32 

1.46 

1.58 

-4.32 

1.72 

1.85 

-2.00 

0.32 

-1.00 

rrr C 1 

' -1.32 

-1.51 

-2.00 

1.87 

-2.00 

-2.74 

-4.32 

-1.51 

-1.00 

m sox q — Q 

1 0.68 

-1.32 

-4.32 

-2.74 

-4.32 

-3.32 

-3.32 

0.96 

1.49 

T 

\ -0.42 

-1.00 

-0.51 

-2.74 

-1.32 

-2.74 

1.85 

-1.15 

-2.32 


where we have assumed that the background distribution is uniform (i.e. Bj = 
0.25 for 1 < j < 4) for illustration purposes. 
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Learning (Motif Discovery) 

In general, motif discovery aims at building motif models (e.g. PFM) from 
related sequences. Nonetheless, there is a variety of motif discovery methods 
in different biological settings. From a computing perspective, they can be 
classified into several paradigms by its input data types: 

1. A Set of Sequences 

2. A Set of Sequences with Quantitative Measurements 

3. A Set of Orthologous Sequences 

Motif Discovery for A Set of Sequences 

The most classical one is de novo motif discovery which just takes a set of 
sequences as the inputs. The set of sequences is extracted such that a common 
transcription factor is believed to bind to them, assuming that motif models 
(e.g. consensus substrings) can be found from those sequences. For example, 
the promoter and enhancer sequences of the genes co-regulated by a common 
transcription factor or the sequence regions around the next generation paral¬ 
lel sequencing peaks called for a common transcription factor. Theoretically, 
Zia and Moses have proved a theoretical upper bound on the p-value which 
at least one motif with a specific information content occur by chance from 
background distribution (false positive) for the one-occurrence per sequence 
motif discovery problem |"l82l . 

Chan et al. applied evolutionary computation techniques to the problem 
EHUMj. Hughes et al. proposed a Gibbs sampling algorithm called Alig- 
nACE, to sample and evaluate different possible motif models using a priori 
log likelihood scores m ■ Workman et al. have proposed a machine learn¬ 
ing approach using artificial neural networks (ANN-Spec) (1*72 . Hertz et al. 
utilized the maximal information content principle to greedily search for a 
set of candidate sequences for building motif models (Consensus) [72]. Frith 
et al. have adopted simulated annealing approaches to perform multiple lo¬ 
cal alignment for motif model building (GLAM) [50]. Ao et al. have used 
expectation maximization to determine DNA motif position weight matrices 
(Improbizer) [5J. Bailey et al. have proposed MEME to optimize the expected 
value of a statistic related to the information content of motif models |9|. A 
parameter enumeration pipeline wrapping MEME (MUSI) was proposed for 
elucidating multiple specificity binding modes [88] . Eskin et al. employed a 
tree data structure to find composite weak motifs (MITRA) [53] . Thijs et al. 
have further improved the classic Gibbs sampling method and called it Mo- 
tifSampler 154 . Van Helden et al. have proposed a counting algorithm to 
detect statistically significant motifs [55] , Regnier and Denise have proposed 
an exhaustive search algorithm (QuickScore) [ 128] , Favorov et al. have uti¬ 
lized Markov Chain Monte Carlo to solve the problem in a Bayesian manner 
(SeSiMCMC) [H] . Pavesi et al. have proposed an exhaustively enumerated 
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and consensus-based method (Weeder) [121] . Another exhaustive search al¬ 
gorithm to optimize z-scores (YMF) has been proposed by Sinha and Tompa 

m- 

Although different statistical techniques have been developed for the motif 
discovery problem, most of the existing methods aim at building motif models 
in the form of either a set of strings or a zero-order PWM. Nonetheless, it 
is well known that nucleotide dependencies and indel operations exist within 
some TFBSs (a.k.a. motifs) [1551 (551 . It is desirable to develop new methods 
which can capture and model such information. 

Motif Discovery for A Set of Sequences with Quantitative Affinity 
Measurements 

It has been pointed out that a fundamental bottleneck in TFBS identification 
is the lack of quantitative binding affinity data for a large portion of tran¬ 
scription factors. The advancement of new high-throughput technologies such 
as ChIP-Chip, ChIP-Seq, and Protein Binding Microarray (PBM) has made 
it possible to determine the binding affinity of these TFs (i.e. each sequence 
can be associated with a binding intensity value) |8J. In particular, the PBM 
technology can enable us to enumerate all the possible k-mers, providing an 
unprecedentedly high resolution binding site affinity landscape for each TF. 
In light of this deluge of quantitative affinity data, robust probabilistic meth¬ 
ods were developed to take into account those quantitative affinity data. Seed 
and Wobble has been proposed as a seed-based approach using rank statistics 
[15] . RankMotif-t—|- was proposed to maximize the log likelihood of their prob¬ 
abilistic model of binding preferences [25J. MatrixREDUCE was proposed to 
perform forward variable selections to minimize the sum of squared devia¬ 
tions m■ MDScan was proposed to combine two search strategies together, 
namely word enumeration and position-specific weight matrix updating [104] . 
PREGO was proposed to maximize the Spearman rank correlation between 
the predicted and the actual binding intensities [152] . Notably, Wong et al. 
have proposed and developed a hidden Markov model approach to learn the 
dependence between adjacent nucleotide positions rigorously; they also show 
that their method (kmerHMM) can deduce multiple binding modes for a given 
TF fTHfij . 

Note that this paradigm is a generalization from the motif discovery for a 
set of sequences with binary measurements. For example, SeedSearch im and 
DME [145; . In other words, it also includes the discriminative motif discovery 
method in which a set of motif-containing sequences and a set of background 
sequences are given as the input since we can assign a value of 1 to each 
motif-containing sequence and 0 to each background sequence. 

Motif Discovery for A Set of Orthologous Sequences 

It is generally acknowledged that by comparing evolutionarily related DNA or 
protein sequences, functionally important sequences or motifs can be revealed 
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by such comparison. A functional motif is assumed to be more conserved 
across different species than the background sequences [55]. By incorporating 
the evolutionary conservation with sequence-specific DNA binding affinity, 
different methods have been proposed. Moses et al. have proposed an exten¬ 
sion of MEME to take the sequence evolution into account probabilistically 
m- Kellis et al. have proposed a spaced hexamer enumeration approach to 
identify conserved motifs [84] . FootPrinter has been proposed as a substring- 
parsimony based approach using dynamic programming to find statistically 
enriched motif substrings J2U] ■ A Gibbs sampling approach named PhyloGibbs 
has also been proposed Mi- 

Prediction (Motif Search) 

After a motif model has been found, it is always desirable to apply it to 
search for motif instances over a given sequence (e.g. ChIP-Seq peak se¬ 
quences). Some basic search methods have been developed to search motif 
instances over a sequence. Nonetheless, those methods do not have sufficient 
motif model complexity to distinguish false positives from true positives over 
a long sequence (e.g. 100k bp) [162j . To cope with that, some improvements 
have been made. In general, most of them utilize the biological information 
beyond the motif sequence specificity to augment the motif model complex¬ 
ity insufficiency. In particular, multiple motif information and evolutionary 
conservation have been readily adopted to improve the discovery accuracy 



Basic Searches 
Likelihood Ratio 

Given a sequence & 16263 ... 6 / of length l and a PWM M of the motif x of 
width w, we scan 6 i& 2 & 3 ---fr; with a window of width w such that the subse¬ 
quences which likelihood ratio score is higher than a pre-specified threshold 
are considered the instances (hits) of the motif x. Mathematically, a subse¬ 
quence bk+ibk+ 2 ---bk+w is considered as a motif instance (hit) if and only if 
the following condition is satisfied: 

w 4 

S x (b k +i bk-\-2’”bk-\-w') EE I(bk+i = > threshold 

*=1 j =1 

where rij is the jth nucleotide among {A,C,G,T} and /(...) is the Iverson 
bracket. Nonetheless, different motifs may have different likelihood score dis¬ 
tributions. It is difficult to set a single and fixed threshold which can work 
for all the motifs. To solve the problem, one can normalize the score to the 
interval [ 0 , 1 ] based on the maximal and minimal scores as follows: 
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S (b k -\-l ^fc+2 ■ • -^fc+u’) 


S x {b k b k+l b k+2 -.b n ) - min S x (seq) 

seq 

max S x (seq) — min S x (seq) 

seq seq 


Posterior Ratio 


If we know the prior probability of motif occurrence 7 r, it can also be incorpo¬ 
rated into the scoring function in a posterior manner insj. Mathematically, 
given a sequence a, motif model (including PFM F and PWM M), and back¬ 
ground distribution B, we can compute the posterior ratio as follows: 


S"(a) 


P(F\a) 

P(B\a) 

P(a\F)P(F) 

PM 

F(a\B)P(B) 

P(a) 

P(a\F)P(F) 

P(a\B)P(B) 

ii-i 11: 

nr=iii-=i b^(i-tt) 

w 4 77, 

^nn(f) K - ] (^) 


log 


log 


log 


log 


i=ij=i 


S(a) + log (——) 
1 — 7 r 


( 6 ) 

(7) 

( 8 ) 
(9) 

( 10 ) 

( 11 ) 


It can be observed that the posterior ratio can be computed from the likelihood 
ratio by simply adding the logarithm of the prior probability ratio. 


P-value 


Given the previous scoring functions, it is not easy to set a threshold since 
they are just ratios. For example, if S(a) > 0 in the above example, it just 
means the likelihood that the sequence a is generated by the motif model 
is higher than the background and vice versa. To justify it in a meaningful 
way, P-value distribution can be calculated from a motif model. Given a motif 
PWM M of width w , an exhaustive search can be applied to traverse all the 
possible sequences of width w. Nonetheless, it takes 4 W time complexity for 
the DNA alphabet { A , G, G, T}. Interestingly, if the PWM M is of zero-order, 
we can exploit the column independence assumption and apply dynamic pro¬ 
gramming to calculate the exact P-value distribution in 4w time complexity 
|149| . In practice, the empirical P-value distribution may also be used. 

Nonetheless, the specificity of a PWM of width w is still not high if it 
is applied to a very long sequence of length L. Mathematically, even if we 
just assign the best match as the hit, L ~^ +1 hits are still expected (e.g. If 
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L = 10000 and w = 6, 2.44 hits are expected), assuming that the sequence is 
uniform in background nucleotide distribution. To solve the problem, people 
have spent efforts on incorporating more biological information to improve 
the motif search. 

Incorporating Multiple Motif Information 

To improve motif search, multiple motif information can be incorporated. 
Multiple motif sites are usually clustered together, resulting in higher signal- 
to-noise ratios which can be easier to be detected than alone. If multiple sites 
of the same motif are clustered together within a short distance, it is called 
homotypic clustering [11)2) . On the other hand, if multiple sites of different 
motifs are clustered together within a short distance, it is called heterotypic 
clustering. 

To exploit the additional clustering signals beyond sequence specificity, 
MAST was proposed to multiply the P-values of multiple motif matches (hits) 
together, which has demonstrated superior performance in sequence homol¬ 
ogy search than the other two methods proposed in the same study 1.10). 
CIS-ANALYST was proposed as a sliding window approach to predict the 
windows which have at least minsites motif matches (hits) with pvalues 
< sitejp m- Sinha et al. have proposed a probabilistic model, Stubb, to 
efficiently detect clusters of binding sites (i.e. cis-regulatory modules) over 
genomic scales using maximum likelihood estimation m- To determine the 
window size parameter, a window size adjustment procedure has been used 
in ClusterBuster to find clusters of motif matches m- Segal et al. have also 
derived an expectation maximization algorithm to model the clusters of motif 
matches as probabilistic graphical models [134] . Recently, Hermann et al. have 
proposed an integrative system (i-cisTarget) to combine the high-throughput 
next generation sequencing data with motif matches to provide accurate motif 
cluster search m■ Notably, Zhou and Wong have shown that it is possible to 
search for clusters of motifs in a de novo way (i.e. without any given motif 
model and information) [1'SIJ . 

Incorporating Evolutionary Conservation 

Another approach to improve motif search is to incorporate evolutionary con¬ 
servation. The rationale behind that is similar to that behind phylogenetic 
motif discovery which we have described in a previous section. Deleterious 
mutations will be removed from population by negative selection. To make 
use of that fact, we could imply that a true motif match should be more con¬ 
served across closely related species (For example, chimpanzee and mouse) 
than background sequences [561 . For instance, a windowing approach with sev¬ 
eral thresholds for motif matches and conservation, ConSite, was proposed by 
Sandelin et al. m- Nonetheless, it is limited to pair-wise analysis. rVISTA 
is a similar approach |39j using the Match program for motif matching in 
TRANSFAC [109 j. Bayesian Branch Length Score (BBLS) was proposed as 
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a evolutionary conservation score without relying on any multiple sequence 
alignment [ 175] . A parsimonious method for finding statistically significant 
k-mers with dynamic programming was proposed (FoorPrinter) 121) . Notably, 
Moses et al. proposed a comprehensive probabilistic model to search for motif 
instances with efficient p-value estimation (MONKEY) [115 1. 

To search for novel motif instances, there are programs aimed at search¬ 
ing for motif matches without any given motif model and information. For 
instance, Ovcharenko et al. have used likelihood ratio tests to distinguish con¬ 
served regions from the background [119] . Siepel et al. have demonstrated an 
approach in identifying conserved regions using hidden Markov models called 
PhastCons m- 

Incorporating Both Approaches 

Both motif clustering information and evolutionary conservation were demon¬ 
strated beneficial to motif search. Since they are independent of each other, 
it is straightforward to combine them. Philippakis et al. have proposed a 
method to combine both types of information (i.e. motif clustering and evolu¬ 
tionary conservation), achieving good performance on experimentally verified 
datasets [124] , MONKEY has been extended by Warner et al. to exploit the 
motif clustering information to predict motif clusters (PhylCRM) [l'5'l] . It has 
been reported that the misalignment errors of the input reference sequences 
from other species could affect the quality of phylogenetic footprinting for mo¬ 
tif search. Thus statistical alignments have been used to assist motif search in 
EMMA [68]. Stub has also been extended to StubMS to take multiple species 
conservation into account using a HMM phylogenetic model [ 442) . Notably, a 
unified probabilistic framework which integrates multiple sequence alignment 
with binding site predictions, MORPH, was proposed by the same group [ 1139 ]. 
Its effectiveness has been demonstrated and verified in an independent com¬ 
parison study m- 

2.2 Genome-wide DNA Binding Pattern Discovery 

Chromatin immunoprecipitation (ChIP) followed by high-throughput sequenc¬ 
ing (ChIP-Seq) measures the genome-wide occupancy of transcription factors 
in vivo. In a typical ChIP-Seq study, the first step is to call the peaks, i.e. 
determining the precise location in the genome where the TF binds. A number 
of peak calling tools have been developed; for instance, model-based analysis 
of ChIP-Seq data (MACS) was proposed to model the shift size of ChIP-Seq 
tags and local biases to improve its peak-calling accuracy 11801 . Spp is another 
method with a strong focus on background signal correction [SB]. PeakSeq is 
a two-pass strategy method. The first pass accounts for the sequence mappa- 
bility while the second pass is to filter out statistically insignificant regions 
comparing to controls m ■ CisGenome refines peak boundaries and uses 
a conditional binomial model to identify peak regions [78j . However, recent 
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benchmark studies suggest that their predicted peaks are distinct from each 
other HB31I531. 

Different combinations of DNA-binding protein occupancies may result in 
a gene being expressed in different tissues or at different developmental stages. 
To fully understand a gene’s function, it is essential to develop unsupervised 
learning models on multiple ChIP-Seq profiles to decipher the combinatorial 
regulatory mechanisms by multiple transcription factors. 

Since multiple transcription factors often work in cis regulatory modules 
to confer complex gene regulatory programs, it is necessary to develop models 
on multiple ChIP-Seq datasets to decipher the combinatorial DNA-binding 
mechanism. In the following, we briefly review some of the previous works 
in this area. Gerstein et al. used pair-wise peak overlapping patterns to con¬ 
struct a human regulatory network m- Xie et al. proposed self organizing 
map methods to visualize the co-localization of DNA-binding proteins |174j . 
Giannopoulou et al. proposed a non-negative matrix factorization to eluci¬ 
date the clustering of DNA-binding proteins [55]. Zeng and colleagues pro¬ 
posed jMOSAiCS to discover histone modification patterns across multiple 
ChIP-Seq datasets Il78l . Ferguson et al. have described a hierarchical Bayes 
approach to integrate multiple ChIP-Seq libraries to improve DNA binding 
event predictions. In particular, they have applied the method to histone 
ChIP-Seq libraries and predicted the gene locations associated with the ex¬ 
pected pathways [45] . Mahony et al. also proposed a mixture model (Multi- 
GPS) to detect differential binding enrichment of a DNA-binding protein in 
different cell lines, which can improve the protein’s DNA binding location pre¬ 
dictions (i.e. Cdx2 protein in their study) |108| . On the other hand, Chen et 
al. proposed a statistical framework (MM-ChIP) based on MACS to perform 
an integrative analysis of multiple ChIP datasets to predict ChIP-enriched 
regions with known motifs for a given DNA-binding protein (i.e. ER and 
CTCF proteins in their study) [50]. On the other hand, Ji et al. proposed 
a differential principal component analysis method on ChIP-Seq to perform 
unsupervised pattern discovery and statistical inference to identify differential 
protein-DNA interactions between two biological conditions m- Guo et al. 
described a generative probabilistic model (GEM) for high resolution DNA 
binding site discovery from ChIP data m • Interestingly, that model combines 
ChIP signals and DNA motif discovery together to achieve precise predictions 
of the DNA binding locations of a DNA-binding protein. The authors have 
further demonstrated how GEM can be applied to reveal spatially constrained 
transcription factor binding site pairs on a genome. 

Despite the success of the methods described above, to fully understand 
a gene’s function, it is essential to develop probabilistic models on multi¬ 
ple ChIP-Seq profiles to decipher the genome-wide combinatorial patterns of 
DNA-binding protein occupancy. Unfortunately, the majority of the previous 
work usually focused on large-scale clustering of called peaks, which is an intu¬ 
itive and straightforward approach. However such approaches have two limita¬ 
tions, as (i) peak-calling ignores the contributions from weak bindings of TFs, 
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and (ii) pair-wise analysis ignores the complex combinatorial binding pattern 
among the TFs. Thus an unsupervised learning model called SignalSpider has 
been proposed to directly analyze multiple normalized ChIP-Seq signal pro¬ 
files on all the promoter and enhancer regions quantitatively so that weak 
bindings can be taken into account [ESI EH- Especially, its computational 
complexity has been carefully designed to scale with the increasing ChIP-Seq 
data (i.e. linear complexity). With such a linear complexity, the method (Sig¬ 
nalSpider) has been successfully applied to more than 100 ChIP-Seq profiles 
in an integrated way, revealing different genome-wide DNA-binding modules 
across the entire human genome (hgl9) 11691 . 


3 Unsupervised Learning for inferring microRNA 
regulatory network 

While transcription factors (TFs) are the major transcriptional regulator pro¬ 
teins, microRNA (miRNA), a small ^22 nucleotide noncoding RNA species, 
has been shown to play a crucial role in post-transcriptional and/or transla¬ 
tional regulation [12]. Since the 1993 discovery of the first miRNA let-7 in 
worms, a vast amount of studies have been dedicated to functionally charac¬ 
terizing miRNAs with a special emphasis on their roles in cancer. While TFs 
can serve either as a transcriptional activator or as a repressor, miRNAs are 
primarily known to confer rnRNA degradation and/or translational repres¬ 
sion by forming imperfect base-pair with the target sites primarily at the 3' 
untranslated regions of the messenger RNAs 66]. While miRNAs are typi¬ 
cally ~22 nt long, several experimental studies combined with computational 
methods [SHI E3 HD 25, !j7 15] 1177] have shown that only the first six or seven 
consecutive nucleotides starting at the second nucleotide from the 5' end of 
the miRNA are the most crucial determinants for target site recognition (Fig¬ 
ure]^. Accordingly, the 6mer or 7mer close to the 5' region of the miRNA 
is termed as the “seed” region or seed. miRNAs that share common seeds 
belong to an miRNA family as they potentially target a vastly common set of 
mRNAs. Moreover, the target sites at the 3'UTR Watson-Crick (WC) pairing 
with the miRNA seed are preferentially more conserved within mammalian 
or among all of the vertebrate species [98] , In humans, more than one third 
of the genes harbour sites under selective pressure to maintain their pairing 
to the miRNA seeds [D3E2|. An important variation around this seed-target 
pairing scheme was discovered by Lewis et al. (2005), where the target site 
is flanked by a conserved adenosine ‘A’ facing to the first nucleotide of the 
targeting miRNA m- 

The dynamics of the miRNA regulatory network are implicated in various 
phenotypic changes including embryonic development and many other com¬ 
plex diseases [1471(551 . Although abnormal miRNA expression can sometimes 
be taken as a stronger indicator of carcinoma in clinical samples than aberrant 
mRNA expression [127, 106] . the system level mechanistic effects are usually 
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Fig. 2. Canonical miRNA Watson-Crick base pairing to the 3'UTR of the mRNA 
target site. The most critical region is a 6mer site termed as the “seed” occurs at 
the 2-7 position of the 5' end of the miRNA [98;. Three other variations centring at 
the 6mer seed are also known to be (more) conserved: 7mer-m8 site, a seed match 
+ a Watson-Crick match to miRNA nucleotide 8; 7mer-tlA site, a seed match + 
a downstream A in the 3'UTR; 8mer, a seed match + both m8 and tlA. The site 
efficacy has also been proposed in the order of 8mer > 7mer-m8 > 7mer- A1 > 6mer 
[493[62]. The abbreviations are: ORF, open reading frame; (NNNNN), the additional 
nucleotides to the shortest 19 nt miRNA; [A|N], A or other nucleotides; Poly(A), 
polyadenylated tail. 


unclear. A single miRNA can potentially target ~400 distinct genes, and there 
are thousands of distinct endogenous miRNAs in the human genome. Thus, 
miRNAs are likely involved in virtually all biological processes and pathways 
including carcinogenesis. However, functional characterizing miRNAs hinges 
on the accurate identification of their nrRNA targets, which has been a chal¬ 
lenging problem due to imperfect base-pairing and condition-specific miRNA 
regulatory dynamics. In this section, we discuss the current state-of-art ap¬ 
proaches in referring miRNA or miRNA-mediated transcriptional regulatory 
network. Table [T] summarizes these methods. As we will see, each method is 
established through an effective unsupervised learning model by exploiting the 
static sequence-based information pertinent to the prior knowledge of miRNA 
targeting and/or the dynamic information of miRNA activities implicated by 
the recently available large data compendia, which interrogate genome-wide 
expression profiles of miRNAs and/or mRNAs across various cell conditions. 


Table 1. Unsupervised learning methods reviewed in this chapter 


Method Algorithm Section Ref 

PicTar Hidden Markov Model 

TargetScore Variational Bayesian Mixture Model 

GroupMiR Nonparametric Bayesian with Indian Buffet Process 
SNMNMF Constrained nonnegative matrix factorization 

Mirsynergy Deterministic overlapping neighbourhood expansion 

3.1 

3.2 

3.3 

ED 

m 

m 

3.4 

179 

43 

ESI 


PicTar: Probabilistic identification of combinations of Target sites; GroupMiR: 
Group MiRNA target prediction; SNMNMF: Sparse Network-regularized Multiple 
Nonnegative Matrix Factorization; PIMiM: Protein Interaction-based MicroRNA 
Modules. 
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3.1 PicTar 

PicTar (Probabilistic identification of combinations of Target sites) is one of 
the few models that rigorously considers the combinatorial miRNA regula¬ 
tions on the same target 3'UTR [91]. As an overview, PicTar first pre-filters 
target sites by their conservation across select species. However, the funda¬ 
mental framework of PicTar is based on hidden Markov model (HMM) with 
a maximum likelihood (ML) approach, which is built on the logics of several 
earlier works from Siggia group BUBS ESI ESI . Among these works, Pic¬ 
Tar was most inspired by “Ahab”, an HMM-based program developed (by 
the same group) to predict the combinatorial TF binding sites (TFBS) [126 ]. 
Although PicTar has been successfully applied to three studies on vertebrates 
EU (where the original methodology paper was described), fly [53], and worm 
[93] (where some improvements were described), the description of the core 
HMM algorithm of PicTar is rather brief. Here we will lay out the detailed 
technicality of the algorithm based on the information collected from several 
related works [101 HOI 1143111261 ITS] , which will help highlight its strengths, 
limitations, and possible future extensions. 

Let A be a 3'UTR sequence, L the length of S , and w € {1,..., K} the 
target sites for miRNA “word” w of length l w , p w the transition probabil¬ 
ity of the occurrence of miRNA w, and pb the transition probability for the 
background of length 4 = 1, which is simply estimated from the fraction of 
A, U, G, C (i.e., Markov model of order 0) either in S with length >300 
nt or from all query UTRs. To simplify notation, the background letters are 
treated as a special word wq so that pb = p Wo and lb = l WQ = 1. Thus, S can 
be represented by multiple different ways of concatenating the segments cor¬ 
responding to either miRNA target sites or background. The goal is to obtain 
at any arbitrary nucleotide position i of the 3'UTR sequence S the posterior 
probability p(ni = w\S,9) that i is the last position of the word, where 6 is 
the model parameters controlling emission probabilities (see below). 

Following Markov’s assumption, p(jti = io|S, 9) is proportional to the prod¬ 
ucts of the probabilities before and after position i. which can be computed in 
time OIL x K) by Forward-Backward algorithm as described below. Formally, 



( 12 ) 


p(s 1 , .. . ,Sj,7T* = w\9)p(s i+1 , . . . ,S l|si, . . 


■,Si,TTi = W, 6) 


p{S\6) 


p{si, ■ ■ ■ ,Si,TTi = w\9)p(si +1 ,...,s L \ni = w, 9) 


P(S\&) 

Z( 1, i, 7 Ti = w)Z(i + 1, . . . , L\TTi = w) 


(13) 

(14) 


p{S\9) 


(15) 
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where p(S\9) is the likelihood of sequence S or the objective function to be 
maximized in the ML framework, Z(l, i, 7 r 2 ; = w ) and Z{i + 1,..., L | 7 Tj = w) 
can be represented in recursion forms and computed via forward and backward 
algorithm, respectively in time 0(L x K) for K words. Formally, the forward 
algorithm is derived as follows: 


Z(l,i,TTi = w) 

= p{s 1 ,...,Si,m=w) (16) 

= p(si,..., Si\iri = w)p(m = w) (17) 

= p(s i -i w+ 1 ,...,s i \TTi =w)p(si,...,Si-i w \ni =w)p('K i =w) (18) 

= p(Si-l w+ 1 , . . .,Si\TTi = w)p(si, . . .,Si-l w ,TTi = w) (19) 


= p{si-i w +i, ■ ■ ■, Si\iTi = w) p(si ,..., Sj-i w , TTi-i w =w')p(-iTi = n;|7r,;_ iro =u>') 

w' 

(20) 

= e(i-l w + 1,..., i\w) ^2 Z( 1, i - l w , m-i w = w')p w (21) 

w' 

where e(i — l w + 1 ,... ,i\w) is the emission probability assumed known (see 
below) and p. w = p(ni = w\ni-i w = w’) is the transition probability to word w. 
Note that p w is position independent. To start the recursion, Z(l,i < 1) = 1. 
The backward algorithm is similarly derived: 

Z(i + 1, . . . , L|7T,; = w) 

= p(si +1 ,...,s L \TTi = w) (22) 

= y 22p(s i+ i,..., s L ,-K i+1 =w'\n i =w) (23) 

w' 

= '22p(s l+1 ,...,S L \TT i+1 =w')p(TT i+1 =w'\TTi = w) (24) 

w' 

= '22p{Si+ 2, ■ ■ ■ ,S L \^i+l =w')p(Si + 2 -l wl ,,..,S i+ i\w i+ i =w')p(TT i+ 1 =V)'\wi = 
w' 

/ 

= Z(i + 2,..., L\n i+1 = w')e(i + 2 - l w ,,..., i + 11 w')p w > (25) 

W 


To start the backward recursion, Z(L — l w + 1, L) = po(w ), which is simply 
the frequency of word w in the 3'UTR sequence S. 

Given the emission probabilities, p w ’s (transition probability) are the only 
parameters that need to be set in order to maximize p(S\9). Following the ML 
solution, 


E l P( 7T i = w\S,0) 
E^'E iPim = w'\S,6) 


(26) 


where the posterior p(jri = w\S,9) is calculated by Eq (15), which is in turn 
computed by forward-backward algorithm. Finally, the likelihood (objective) 
function is evaluated by a simple forward pass to the end of the sequence: 
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Algorithm 1 Baum-Welch HMM algorithm in PicTar m 


Initialize Z(l,i < 1) = 1 and Z(L — l w + 1, L) = po(w) 

E-step: 


Forward recursion (i = 1,... , L): compute Z( 1, i, m = w) by Eq (211 
Backward recursion (i = L — l w ,..., 1): compute Z(i + 1,..., L\-iTi = w) by Eq 


(251 

M-step: 

Update p w by Eq (261 

Evaluate likelihood: 

Compute p(S\9) by Eq (281 

Repeat EM steps until p(S\9) increases by less than a cutoff 


P{S\0) ='^2p(si,...,s l ,tt l =w’\9) (27) 

w' 

= Y j Z^L,k l =w') (28) 

w' 

Together, the optimization of p w in PicTar is performed using Baum-Welch 
algorithm for Expectation-Maximization (EM) as summarized in Algorithm 
[I] Finally, the PicTar score is defined as a log ratio (F = — log Z) of the ML 
over background likelihood Fb- 

PicTarScore = Fb — F (29) 

where Fb is the likelihood when only background is considered. 

As previously mentioned, the emission probabilities e(s|u>) in PicTar are 
assumed known. In Ahab for modelling TFBS, e(s|ui) or m(s\w) is based on 
the position frequency matrix (PFM): m(s|w) = IIj=i fj{ n \ w )i where fj(n\w) 
is the normalized frequency of the nucleotide n at the j-th position of the 
PFM. However, miRNAs do not have PFM. The original PicTar arbitrarily 
sets e(s|w;) to be 0.8 if there is perfect seed-match at 1-7 or 2-8 nt positions 
of the miRNA 5' end AND the free binding energy as estimated by RNAhy- 
brid is no less than 33% of the optimal binding energy of the entire miRNA 
sequence to the UTR 11051 : otherwise e(s|w) is set to 0.2 divided by M for M 
imperfect seed matches with only 1 mismatch allowed, provided it is above 
66 % of the optimal binding energy. Thus, the setting highly disfavours imper¬ 
fect seed match. The later version of PicTar changes the emission probability 
calculation to be the total number of occurrences in conserved 3'UTR sites 
divided by the total number of sites 3'UTR [53]. The setting appears to im¬ 
prove the sensitive/specificity of the model but makes it more dependent on 
the cross-species conservation, potentially prone to false negatives (for the 
non-conserved but functional sites). 

The major advantage of PicTar over other simpler methods is that the co¬ 
ordinate actions of the miRNAs (synergistic in case of optimally spaced sites or 
antagonistic in case of overlapping binding sites) are naturally captured within 
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the emission and transition probabilities. For instance, the PicTarScore as 
the joint ML of multiple miRNA target sites will be higher than the linear sum 
of individual miRNA target sites (i.e., synergistic effects). Longer 3'UTR will 
score less than shorter 3'UTR if both contain the same number of target sites. 
PicTar demonstrated a comparable signahnoise ratio relative to TargetScan 
and was compared favourably with some of the earlier published methods 
based on several surveys j3j [177 , 1135 ', ; 99] . When applied to vertebrates, Pic¬ 
Tar identified roughly 200 genes per miRNA, which is a rather conservative 
estimate compared to the recent findings by TargetScan with a conserved 
targeting scoring approach (35i. When applied to C. elegans (worm), Pic¬ 
Tar identified 10% of the C. elegans genes that are under conserved miRNA 
regulation. 

Nonetheless, PicTar has three important limitations. First, PicTar does 
not consider the correlation between miRNA target sites since p w is essen¬ 
tially position independent. This is perhaps largely due to the increased model 
complexity when considering all pairwise transition probabilities between K 
miRNAs and background since there will be (I\ + 1) x K + 1 parameters 
to model (as apposed to only K + 1). Supported by [52], however, the spe¬ 
cific spatial arrangement of the target sites may be functionally important. In 
particular, the optimal distance between miRNA sites was estimated as 8 to 
~ 40 nt based on transfection followed by regression analysis [55]. Although 
unsupported by experimental evidence, the ordering of some specific target 
sites may be also important. For instance, target site x must be located before 
target site y (for the same or different miRNA) to achieve optimal synergistic 
repression. The model that takes into account the spatial correlation between 
motifs is called the hcHMM (history conscious HMM) implemented in a pro¬ 
gram called Stubb for detecting TF binding sites (TFBS) rather than miRNA 
target predictions Hill- 

Second, the ML approach is prone to local optimal especially for long 
UTRs or many coordinated miRNA actions considered simultaneously (i.e., 
many p w ’s). An alternative HMM formulation is to impose Bayesian priors on 
the HMM parameters (Ml- In particular, m demonstrated such Bayesian 
formalism of HMM in modelling combinatorial TFBS. In their model so-called 
“module HMM”, the transition probabilities is assumed to follow a Dirich- 
let distribution with hyperparameters a = {ao, oq,..., olk+i}, where 
cko corresponds to the background, {oq ... %} to the K TFs, and otx+i to 
the background inside of cis-regulatory module. Inference is performed via 
Markov Chain Monte Carlo (MCMC) procedure or Gibbs sampling in partic¬ 
ular. Briefly, a forward pass and backward pass are run to generate marginal 
probabilities at each nucleotide position. Starting at the end of the sequence, 
hidden states are sampled at each position based on the marginals until reach¬ 
ing the front of the sequence. Given the hidden states, the hyperparameters 
for Dirichlet distribution of the transition probabilities are then updated by 
simply counting the occurrences of each state. The posteriors of the hidden 
states at each nucleotide position are inferred as the averaged number of times 
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the states are sampled in 1000 samplings, and the states with the maximum 
a posteriori (MAP) are chosen. The model is demonstrated to perform better 
than Ahab and Stubb (which are the basis for PicTar) in TFBS predictions 
but have not yet been adapted to miRNA target predictions. 

Third, since data for expression profiling of mRNAs and miRNAs by mi¬ 
croarrays or RNA-seq is now rather abundant (e.g., [7], GSE1738; GSE31568, 
[83]; GSE40499, pH] from GEO) or ENCODE (GSE24565, [36]) or TOGA 
pnrrsg]. the combinatorial regulation needs to be revisited by taking into ac¬ 
count whether or not the co-operative miRNAs are indeed expressed in vivo 
and/or the expression correlation between mRNA and miRNA. In particu¬ 
lar, the emission probabilities e(s|w;) need to be redefined to integrate both 
sequence-based and expression-based information. 

3.2 A probabilistic approach to explore human miRNA target 
repertoire by integrating miRNA-overexpression data and 
sequence information 

One of the most direct way to query the targets of a given miRNA is by 
transfecting the miRNA into a cell and examine the expression changes of 
the cognate target genes |l03j . Presumably, a bona fide target will exhibit de¬ 
creased expression upon the miRNA transfection. In particular, overexpression 
of miRNA coupled with expression profiling of mRNA by either microarray 
or RNA-seq has proved to be a promising approach [Ml Eg. Consequently, 
genome-wide comparison of differential gene expression holds a new promise to 
elucidate the global impact of a specific miRNA regulation without solely re¬ 
lying on evolutionary conservation. However, miRNA transfection is prone to 
off-target effects. For instance, overexpressing a miRNA may not only repress 
the expression of its direct targets but also cause a cascading repression of in¬ 
direct targets of the affected transcription activators. To improve prediction 
accuracy for direct miRNA targets, this chapter describes a novel model called 
TargetScore that integrates expression change due to miRNA overexpression 
and sequence information such as context score [H; [51] and other orthogonal 
sequence-based features such as conservation [J5] into a probabilistic score. 

In one of our recent papers, we described a novel probabilistic method 
for miRNA target prediction problem by integrating miRNA-overexpression 
data and sequence-based scores from other prediction methods [59]. Briefly, 
each score feature is considered as an independent observed variable, which 
is the input to a Variational Bayesian-Gaussian Mixture Model (VB-GMM). 
We chose a Bayesian over a maximum likelihood approach to avoid overfit¬ 
ting. Specifically, given expression fold-change (due to miRNA transfection), 
we use a three-component VB-GMM to infer down-regulated targets account¬ 
ing for genes with little or positive fold-change (due to off-target effects [85]). 
Otherwise, two-component VB-GMM is applied to unsigned sequence scores. 
The parameters for the VB-GMM are optimized using Variational Bayesian 
Expectation-Maximization (VB-EM) algorithm. The mixture component with 
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the largest absolute means of observed negative fold-change or sequence score 
is associated with miRNA targets and denoted as “target component”. The 
other components correspond to the “background component”. It follows that 
inferring miRNA-mRNA interactions is equivalent to inferring the posterior 
distribution of the target component given the observed variables. The tar- 
getScore is computed as the sigmoid-transformed fold-change weighted by the 
averaged posteriors of target components over all of the features. 

Bayesian mixture model 

Assuming there are N genes, we denote x = (aq,..., Xn) t as the log expres¬ 
sion fold-change (x/) or sequence scores (x;, l £ {1,..., L}). Thus, for L sets 
of sequence scores, x £ {x/, xi, ..., x^}. To simplify the following equations, 
we use x to represent one of the independent variables without loss of gen¬ 
erality. To infer target genes for a miRNA given x, we need to obtain the 
posterior distribution p(z|x) of the latent variable z £ {zi ,..., Zk}, where 
A'=3 (K= 2) for modelling signed (unsigned) scores such as logarithmic fold- 
changes (sequence scores). 

We follow the standard Bayesian GMM based on (19] (p474-482) with 
only minor modifications. Although univariate GMM (D = 1) is applied to 
each variable separately, we implemented and describe the following formal¬ 
ism as a more general multivariate GMM, allowing modeling the covariance 
matrices. Briefly, the latent variables z are sampled at probabilities tv (mix¬ 
ing coefficient), that follow a Dirichlet prior Dir(Tv\a.o) with hyperparameters 
cto = ( a o.i) • • ■, ao.if). To account for the relative frequency of targets and 
non-targets for any miRNA, we set the cto,i (associated with the target com¬ 
ponent) to aN and other a 0 ,fc = (1 — a) x N/(K — 1), where a = 0.01 (by 
default). Assuming x follows a Gaussian distribution A/”(x|/x, A” 1 ), where A 
(precision matrix) is the inverse covariance matrix, p(fx, A) together follow 
a Gaussian-Wishart prior J\f(p, k \m Ql ( / 9 0 A) _ 1 )yV , (Afc|Wo, i/ Q ), where the 
hyperparameters {m 0 , /3 0 , W 0 , u 0 } = {/t, 1, Idxd, D + 1 }. 

Variational Bayesian Expectation Maximization 

Let 6 = {z, 7 r,p,A}. The marginal log likelihood can be written in terms 
of lower bound C(q) (first term) and Kullback-Leibler divergence ICC(q\\p) 
(second term): 



(30) 


where q(0) is a proposed distribution for p(6 |x), which does not have a closed 
form distribution. Because lnp(x) is a constant, maximizing C(q) implies min¬ 
imizing K.C(q\\p). The general optimal solution In qj(0j) is the expectation 
of variable j w.r.t other variables, Ej^j[lnp(x, 6)]. In particular, we define 
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q( z, tv, n, A) = q(z)q(iv)q(fj,, A). The expectations for the three terms (at log 
scale), namely In g*(z), Inq* (tv). In g*(/x, A), have the same forms as the ini¬ 
tial distributions due to the conjugacy of the priors. However, they require 
evaluation of the parameters {z,tv ,/x, A}, which in turn all depend on the 
expectations of z or the posterior of interest: 

p(z nk \x n ,6) = E[z nfc ] = £ nk - (31) 

2^j=l Pnj 


where lnp nfc = E[ln 7r fe ] + §E[ln | A fc |] - f ln(27r) - jE^a*, [(x„ - fj, k ) T A k (jc n - 
H k )}. The inter-dependence between the expectations and model parameters 
falls naturally into an EM framework, namely VB-EM. Briefly, we first ini¬ 
tialize the model parameters based on priors and randomly sample K data 
points fi. At the i th iteration, we evaluate (31) using the model parameters 
(VB-E step) and update the model parameters using pT| ) (VB-M step). The 
EM iteration terminates when C(q) improves by less than 10 -20 (default). 
Please refer to m for more details. 


TargetScore 

We define the targetScore as an integrative probabilistic score of a gene being 
a target t (meaning that z nk = 1 for the target component k) of a miRNA: 


targetScore = cr(— log FC ) 


L + 1 


E 


p(t |x) 


(32) 


x£{x/,xi.x L } 


where a (-log FC) = 1+exp( i ogfC ) , p(t |x) is the posterior in ( |3l] ). 


TargetScore demonstrates superior statistical power compared to exist¬ 
ing methods in predicting validated miRNA targets in various human cell¬ 
lines. Moreover, the confidence targets from TargetScore exhibit comparable 
protein downregulation and are more significantly enriched for Gene Ontol¬ 
ogy terms. Using TargetScore, we explored oncomir-oncogenes network and 
predicted several potential cancer-related miRNA-messenger RNA interac¬ 
tions. TargetScore is available at Bioconductor http://www.bioconductor. 
org/packages/devel/bioc/html/TargetScore.html. 


Network-based methods to detect miRNA regulatory modules 

Although targets of individual rniRNAs are significantly enriched for certain 
biological processes |1201ll57] . it is also likely that multiple rniRNAs are coor¬ 
dinated together to synergistically regulate one or more pathways [9T11231IT76 ]. 
Indeed, despite their limited number (2578 mature rniRNAs in human genome, 
miRBase V20, [5D]), rniRNAs may be in charge of more evolutionarily robust 
and potent regulatory effects through coordinated collective actions. The hy¬ 
pothesis of miRNA synergism is also parsimonious or biologically plausible 











Unsupervised Learning in Genome Informatics 


23 


because the number of possible combinations of the 2578 human miRNAs is 
extremely large, enough to potentially react to virtually countless environ¬ 
mental changes. Intuitively, if a group of (miRNA) workers perform similar 
tasks together, then removing a single worker will not be as detrimental as 
assigning each worker a unique task [23]. 

Several related methods have been developed to study miRNA synergism. 
Some early methods were based on pairwise overlaps 136] or score-specific 
correlation [ 176] between predicted target sites of any given two (co-expressed) 
miRNAs. For instance, Shalgi et al. (2007) devised an overlapping scoring 
scheme to account for differential 3'UTR lengths of the miRNA targets, which 
may otherwise bias the results if standard hypergeonretric test was used m- 
Methods beyond pairwise overlaps have also been described. These methods 
considered not only the sequence-based miRNA-target site information but 
also the respective miRNA-mRNA expression correlation (MiMEC) across 
various conditions to detect miRNA regulatory modules (MiRMs). 

For instance, Joung et al. (2007) developed a probabilistic search pro¬ 
cedure to separately sample from the mRNA and miRNA pools candidate 
module members with probabilities proportional to their overall frequency 
of being chosen as the “fittest”, which was determined by their target sites 
and MiMEC relative to the counterparts [52]. The algorithm found only 
the single best MiRM, which varied depending on the initial mRNA and 
miRNA set. Other network-based methods using either the sequence infor¬ 
mation only or using m/miRNA expression profiles only as a filter for a more 
disease-focused network construction on only the differentially expressed (DE) 
m/miRNAs. For instance, Peng et al. (2006) employed an enumeration ap¬ 
proach to search for maximal bi-clique on DE m/miRNAs to discover com¬ 
plete bipartite subgraphs, where every miRNA is connected with every mRNA 
m- The approach operated on unweighted edges only, which required dis¬ 
cretizing miRNA-mRNA expression correlation. Also, maximal bi-clique does 
not necessarily imply functional MiRMs and vice versa. 

The following subsections review in details three recently developed net¬ 
work methods (Table [TJ to detect MiRMs. Despite distinct unsupervised 
learning frameworks, all three methods exploit the widely available paired 
m/miRNA expression profiles to improve upon the accuracy of earlier devel¬ 
oped (sequence-based) network approaches. 

3.3 GroupMiR: inferring miRNA and mRNA group memberships 
with Indian Buffet Process 

The expression-based methods reviewed elsewhere m were essentially de¬ 
signed to explain the expression of each mRNA in isolation using a subset of 
the miRNA expression in a linear model with a fixed set of parameters. How¬ 
ever, the same mRNAs (miRNAs) may interact with different sets of miRNAs 
(rnRNAs) in different pathways. The exact number of pathways is unknown 
and may grow with an increase of size or quality of the training data. Thus, 
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it is more natural to infer the number of common features shared among dif¬ 
ferent groups of miRNAs and mRNAs. Accordingly, Le et al. (2011) proposed 
a powerful alternative model called GroupMiR (Group MiRNA target predic¬ 
tion) [ST . As an overview, GroupMiR first explored the latent binary features 
or memberships possessed within mRNAs, miRNAs, or shared between mR¬ 
NAs and miRNAs on a potentially infinite binary feature space empowered 
by a nonparametric Bayesian (NBP) formalism. Thus, the number of features 
was inferred rather than determined arbitrarily. Importantly, the feature as¬ 
signment took into account the prior information for rniRNA and mRNA 
targeting relationships, obtained from sequence-based target prediction tools 
such as TargetScan or PicTar. Based on the shared memberships, mRNAs 
and/or miRNAs formed groups (or clubs). The same miRNAs (mRNAs) could 
possess multiple memberships and thus belong to multiple groups each cor¬ 
responding to a latent feature. This was also biologically plausible since a 
miRNA (mRNA) may participate in several biological processes. Similar to 
GenMiR+- 1 - ca, GroupMiR then performed a Bayesian linear regression on 
each mRNA expression using all miRNA expression but placing more weight 
on the expression of miRNAs that shared one or more common features with 
that mRNA. 

Specifically, the framework of GroupMiR was based on a recently devel¬ 
oped general nonparametric Bayesian prior called the Indian Buffet Process 
(IBP) [60] (which was later on proved to be equivalent to Beta process jl53] ). 
As the name suggests, IBP can be understood from an analogy of a type of 
an ‘Indian buffet’ as follows. A finite number of N customers or objects form 
a line to enter one after another a buffet comprised of K dishes or features. 
Each customer i samples T 41 dishes selected by mk previous customers, 
and Poisson(j) new dishes, where a is a model parameter. The choices of N 
customers on the K dishes are expressed in aniVx K binary matrix Z. A left- 
order function lof(-) maps a binary matrix Z to a left-ordered binary matrix 
with columns (i.e., dishes) sorted from left to right by decreasing order of nik 
and breaking ties in favour of customers who enter the buffet earlier. This pro¬ 
cess defines an exchangeable distribution on equivalence class [Z] comprising 
all of the Z that have the same left-ordered binary matrix Zo/(Z) regardless 
of the order the customers enter the buffet (i.e., row order) or the dish order 
(i.e., column order). 

Before reviewing the IBP derivation, we need to establish some notations 
[BO]. ( zik ,..., i)fc) (* £ {1,..., N}) denotes the history h of feature k at 
object i, which is encoded by a single decimal number. At object i = 3, for 
instance, a feature k has one of four histories encoded by 0,1, 2,3 correspond¬ 
ing to all of the four possible permutations of choices for objects 1 and 2: 
(0,0), (0,1), (1, 0), (1,1). Accordingly, for N objects, there are 2 N histories 
for each feature k and 2 N — 1 histories excluding the history of all zeros (i.e., 
(0)ixJv)- Additionally, A/, denotes the number of features possessing the same 
history h, K 0 for all features with m,k = 0, and K + = J2h=i f° r f ea_ 
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tures for which m k > 0. Thus, K = I\ 0 + K+- It is easy to see that binary 
matrices belong to an equivalence class if and only if they have the same his¬ 
tory profile h for each feature k. The cardinality (card) of an equivalence class 
[Z] is the number of all of the binary matrices with the same history profile: 


card([ Z]) = 


I< 

K 0 ■ ■ ■ K 2 n -i 


K\ 


rCo ' K h \ 


(33) 


As shown below, Eq (33) is essential in order to establish the close-formed 
solution of IBP prior when K —> oo leads to an infinite feature space or 
infinite number of columns in Z. After establishing the above properties, the 
central steps in deriving the IBP prior used in GroupMiR is reviewed below. I 
will focus on some steps neglected from the original work and refer the reader 
to the full derivation when appropriate. As in the original papers, we first 
derive IBP on a finite number of latent features K and then take the limit 


making use of Eq (33). 


Let Z be an N x K binary matrix, where N = M + R for M mRNAs and R 
rniRNAs, and K is the number of latent features. Assuming the binary value 
Zk in Z for each feature k is sampled from Bernoulli^ k ) and are conditionally 
independent given 7Tfc, the joint distribution of Zk is then: 


Zjb. 

V 


P(z k \-Kk) = JJ(1 - TTfc ) 1 Zik K Z k 

i 

= exp ~ Zik ) 1 °S( 1 _ n k) + Zik log 7r fc ^ 

where TT k follows a Beta prior Tr k \a ~ Beta(r, s) with r = j|, s = 1: 


p(TT k \a) = 


B(r, s) 


S — 1 


B(§, 1 ) 


(34) 


(35) 


where B(-) is a Beta function. To take into account the prior information be¬ 
tween rniRNA and mRNA targeting from sequence-based predictions, Group¬ 
MiR incorporated in Eq (35) an N x N weight matrix W: 

W = 


0 c 

C T 0 


(36) 


where interaction within mRNAs and within rniRNAs were set to zeros and 
interaction between mRNA and miRNA followed the R x M scoring matrix 
C obtained from a quantitative sequence-based predictions. In particular, Mi- 
Randa scores were used in their paper. Thus, Wij is either 0 or defined as a 
pairwise potential of interactions between mRNA i and miRNA j. The mod¬ 
ified p*(-rrk\a) was then defined as: 

( 37 ) 
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where 3> Zk and the partition function Z' were defined as: 


& Zk = exp Ywijz ik z jk 


. i<j 


z’= Y, & h B(^ + m h , N - m h + 1) 


h —0 


( 38 ) 


(39) 


The marginal probability of P(Z) is derived by integrating out 7Tfc as follows: 


K -i 


P(Z) = TT [ P{z k \TTk)P(n k \a)dit k 

Jo 


(40) 


fc=i 


= TT / exp - Zifc) l°g( 1 - TTfc) + Z ik log n k 

k=i Jo v , 


zv# s 


d'Kk 


(41) 


= II / exp ( ~ Zik ) “ ^fc) + -ife logTTfc j exp - l) logTTfc 

fc=l "' 0 Vi / 

(42) 

A ^ i 

= n J exp ^(iV - TOfe)log(l - 7Tfc) + m fe logTTfc + (-^ - 1) logTTfe^) rfTTfc 


fc=l 


if 


(43) 


= n ^ J exp ((iV - m fe )log(l - 7Tfc) + (^ + m k - 1) log 7T fc ^ dir k 

(44) 


fc=i 


if 


= n +mk ’ N ~ mk+i ) 


fc=i 


(45) 


where Wfc in Eq (43) is the sum over all Zi k = 1, and (45) directly follows 


the definition of Beta function. However, when lim^-^oo P(Z) = 0 since the 
probability of sampling a specific binary matrix from an infinite number of 
matrices is 0. Instead, the inference was performed over the equivalence class 
[Z] with the number of Zo/-equivalent matrices defined above: 







Unsupervised Learning in Genome Informatics 


27 


P([Z]) = £ P(Z) 
ze[z] 

A! 


(46) 


if 


nr^o 1 ^!^ z 


n +m kl N -m k + 1) 


K. 


J fc p a z l) = ^=r 


nu 1 av 


n#zfc (iV-m fc )!(m fc - 1)! ex p(-^) 


h- k =l 


where 


*= E 


/i=0 


(iV - m k )\[m k - 1)! 
1 TV! 


(47) 

(48) 

(49) 


A more elaborate derivation of (|48j) was described in the Appendix from [94] 
and omitted here. Additionally, the authors also showed that when W = 0 or 


equivalently #/> = 1 for all histories h, then Eq (48) reduces to the original 
IBP introduced in [BOi, which is thus a special case of the weighted IBP in 
GroupMiR. 


Given the IBP prior Eq ( |48[ ), the generative process for Zi k corresponding 
to an existing feature k (where m k > 0) was derived as follows: 


P{zi k — l|Z_ ifc ) — 


P( z ik — 1) Z-i k ) 


P{ z ik — 0; Z_jfc) + P{ Zik — 1, Z-ik) 

&z k ,z ik =i{N - rri-i k - 1 )\{m-ik + 1-1)! 


(50) 


1 * 


t =o(-/V - m-ik)\{m-ik - 1)! + & Zk , Zik =i{N - m- ik - l)!(w_ lfc + 1-1)! 


ex P(Ej^ w ij ' 1 ‘ Zjk)(N - m_ ifc - l)!(m_ ifc + 1-1)! 

ex P(E^i w ij ' 0 ’ z jk)(N - m- ik )\(m-i k - 1)!+ 
ex P(y~! Wij ■ 1 ' Zjk){N - TO_ ifc - l)!(m_ ifc + 1 - 1 )! 




(51) 


(52) 


ex P(Ej^i WijZj k )m- ik 


(N - m-ik) + exp(Ej^j WijZj k )m~i k 


where the subscript —ik (e.g., Z-i k or m—ik) denotes all objects for k except 
for i, Eq (51) arose from cancellations of the common terms in (48), and 


similar for (52). The number of new feature k* (where m k • = 0) are sampled 
from Poisson(j). 

Notably, Z can be expressed as Z = (U T ,V T ) T , where U is a M x K 
binary matrix for mRNA and V is a R x K binary matrix for miRNA. Thus, 
mRNA i and miRNA j are in the same group k if Ui k Vj k = 1. Given U and 
V, the regression model in GroupMiR was defined as: 


' •A/'Ox- Efa + E Sk)Vj,cr 2 I) 


(53) 


k:uikVjk = 1 
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where Xi is the expression of mRNA i. y is the baseline expression for i. 
r.j is the regulatory weight of rniRNA j, is a group-specific coefficient for 
group fc, and yj is the expression of miRNA j. With the Gaussian distribution 
assumption for athe data likelihood then follows as: 

P(X, Y|Z,6») oc exp - Xi) T [x.i - x *)^ (54) 

where 0 = (/x,ct 2 , s,r) and x t = y, - T,j( r j + Y,k-.u ik v jk = l s k)Vj- Th e (conju¬ 
gate) priors over the parameters in 0 were defined and omitted here. 

Finally, the marginal posterior of Z are defined as: 


P(zik |X, Y, Z_ (ifc) ) oc P(X,Y\Z_ {ik) ,z ik )P(z ik \z_ lk ) 


(55) 


where P(X, Y|Z_( ife ), Zik) was obtained by integrating the likelihood (Eq ( [54] )) 
over all parameters in 0 and P(zik\z-n s ) from Eq (50). 

Due to the integral involved above, analytical solution for posterior in ( [55] ) 
is difficult to obtain. Accordingly, the inference in GroupMiR was performed 
via MCMC: 


1. Sample an existing column z^k from Eq (50); 

2. Assuming object i is the last customer in line (i.e., i = N), sample 
Poisson(j^) new columns and sample from its prior (Gamma distri¬ 
bution) for each new column; 

3. Sample the remaining parameters by Gibbs sampler if closed-form poste¬ 
rior of the parameter exists (due to conjugacy) or by Metropolis-Hasting 
using likelihood Eq (541 to determine the acceptance ratio; 


4. Repeat 1-3 until convergence. 


Finally, the posteriors of Z in Eq (55) for all feature column with at least one 


nonzero entry serve as the target prediction of GroupMiR. 


GroupMiR was applied to simulated data generated from Eq (53) with 


K = 5 (i.e., 5 latent features shared among miRNA and mRNA) and increas¬ 
ing noise level (0.1, 0.2, 0.4, 0.8) to the prior scoring matrix C, mimicking the 
high false positive and negative rates from the sequence-based predictors. At 
the low noise levels of 0.1, 0.2, or 0.4, GroupMiR was able to identify exactly 5 
latent features and above 90% accuracy in predicting the correct memberships 
between miRNA and mRNA. At the high noise level of 0.8, on the other hand, 
GroupMiR started to identify > 5 latent features but the accuracy remained 
above 90%, demonstrating its robustness. In contrast, GenMiR+-1- had much 
a lower performance than GroupMiR on the same data, scoring lower than 
60% accuracy at the high noise level. GroupMiR was also applied to the real 
microarray data with 7 time points profiling the expression of miRNA and 
mRNA in mouse lung development. However, only the top 10% of the genes 
with highest variance were chosen leading to 219 miRNAs and 1498 mRNAs. 
Although the authors did not justify using such a small subset of the data, it 
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is likely due to the model complexity that prohibited the full exploration of 
the data. Nonetheless, GroupMiR identified higher network connectivity and 
higher GO enrichment for the predicted targets than GenMiR++ on the same 
dataset. It would have been even more convincing, however, if GroupMiR was 
also tested on the same datasets of 88 human tissues, which were used in the 
GenMiR++ study [75j. 

Although the time complexity was not analyzed for GroupMiR, it appears 
similar to, if not higher, than the general IBP, which has a time complexity 
of 0(N 3 ) per iteration for N objects [55]. The slow mixing rate is the main 
issue for IBP-based framework due to the intensive Gibbs samplings required 
to perform the inference, which prevented GroupMiR from fully exploring the 
data space at a genome scale offered by microarray or RNA-seq platforms, 
and consequently compromised the model accuracy. Adaptations of efficient 
inference algorithms that were recently developed for IBP are crucial to un¬ 
leash the full power of the NPB framework [5B]. Additionally, GroupMiR did 
not consider the spatial relationships between adjacent target sites of the 
same mRNA 3'UTR for the same or different miRNAs as in PicTar (Section 


3.11. A more biologically meaningful (IBP) prior may improve the accuracy 


and/or the model efficiency by restricting the possible connections in Z. Fi¬ 
nally, it would be interesting to further examine the biological revelation of 
the groupings from Z on various expression consortia. In particular, miRNAs 
participating in many groups or having higher out-degrees in network con¬ 
text are likely to be more functionally important than others. Moreover, the 
groupings may not only reveal rniRNA and mRNA targeting relationships but 
also the regulatory roles of mRNAs as TFs on miRNA when a modified IBP 
prior is used. Taken together, many directions remained unexplored with the 
powerful NBP framework. 


3.4 SNMNMF: sparse network-regularized multiple nonnegative 
matrix factorization 

The nonnegative matrix factorization (NMF) algorithm was originally de¬ 
veloped to extract latent features from images [SBj I160j . NMF serves as 
an attractive alternative to conventional dimensionality reduction techniques 
such as Principle Component Analysis (PCA) because it factorizes the orig¬ 
inal matrix V sxn (for s images and n pixels) into two non-negative matri¬ 
ces V sxn = W aXk H kXn , where W sxk and H kxn are the “image encoding” 
and “basis image” matrices, respectiveljf/] Notably, k needs to be known be¬ 
forehand. The non-negativity of the two factorized matrices enforced by the 
NMF algorithm provides the ground for intuitive interpretations of the latent 
features because the factorized matrices tend to be sparse and reflective to 


4 In the original paper m , the image matrix V sx „ was transposed, where the 
rows and columns represent the pixel and image, respectively. The representation 
used here is to be consistent with the one used by ESI reviewed below. 
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certain distinct local features of the original image matrix. Kim and Tidor 
(2003) were among the very first groups that introduced NMF into the world 
of computational biology m- In particular, the authors used NMF to assign 
memberships to genes based on the “image encoding” matrix H^xn in order 
to decipher yeast expression network using gene expression data measured by 
microarray. Since then, many NMF-based frameworks were developed [34]. 

In particular, Zhang et al. (2011) extended the NMF algorithm to de¬ 
tecting miRNA regulatory modules (MiRMs) [ 1791 . Specifically, the authors 
proposed a sparse network-regularized multiple NMF (SNMNMF) technique 
to minimize the following objective function: 

W, , H 2 <- argmin V ||X Z - WHi \\ 2 - Ai Tr{H 2 AH%) - X 2 Tr(H 1 BH J) 
w,HuH 2l ± 

+ 7ill^ll 2 + 72(Ell^H 2 + E Ill'll 2 ) ( 56 ) 

3 o' 


where 


• X-[ and A 2 are the s x n mRNA and s x m miRNA expression matrices, 
respectively, for s samples, n mRNAs, and m miRNAs; 

• W is the s x k encoding matrix using k = 50 latent features (chosen 
based on the number of spatially separable miRNA clusters in the human 
genome); 

• Hi and H 2 are the k x n and k x m “image basis” matrices for genes and 
miRNAs, respectively; 

• A is the n x n binary gene-gene interactions matrices as a union of the 
transcription factor binding sites (TFBS) from TRANSFAC 1155] and the 
protein-protein interactions from [24] : 

• B is the n x to binary miRNA-mRNA interaction matrix obtained from 
MicroCosm [61] database that hosts the target predictions from MiRanda 

m- 

• 7il|W|| 2 +7 2 (E J IN| 2 + Ey Ill'll 2 are regularization terms that prevent 
the parameter estimates from growing too large; 

• the weights A-| ; 2 and regularization parameters 71,2 were selected post hoc. 


The original optimization algorithm of NMF was based on a simple gradient 
decent procedure, which operated on only a single input matrix. Here, how¬ 
ever, the partial derivative of Eq (56) with respect to each matrix (IT, H 1, H 2 ) 
depends on the optimal solution from the other two matrices. Accordingly, the 
authors developed a two-stage heuristic approach, which nonetheless guar¬ 
antees to converge to a local optimal: (1) update W fixing H\,H 2 (which 
are initialized randomly); (2) update Hi,H 2 fixing W, repeat 1 & 2 until 
convergence. To cluster m/miRNAs, the authors exploited the encoding ma¬ 
trices Hi (k x n) and H 2 (k x to) by transforming each entry into z-score: 
Zij = (hij — h, j )/a j , where hj and ay are the mean and standard deviation 
of column j of Hi (or H 2 ) for mRNA j (or miRNA j'), respectively. Thus, 
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each m/miRNA was assigned to zero or more features if their corresponding 
z-score is above a threshold. 

SNMNMF was applied to ovarian cancer dataset containing paired m/miRNA 
expression profiles from TCGA measuring 559 miRNAs and 12456 genes for 
each of the 385 patient samples. The authors found that more than half of 
the 49 modules (1 module was empty) identified by SNMNMF were enriched 
for at least one GO terms or KEGG pathway. Also, miRNAs involved in 
the SNMNMF-MiRMs were enriched for cancer-related miRNAs. Moreover, 
Kaplan-Meier survival analysis revealed that some of the k latent features 
from the basis matrix W SX k offerred promising prognostic power. Finally, 
SNMNMF compared favourably with the enumeration of bi-clique (EBC) al¬ 
gorithm proposed by Peng et al. (2009) in terms of the number of miRNAs 
involved in the modules and GO/pathway enrichments [122) . Specifically, the 
authors found that EBC tended to produce modules involving only a single 
miRNA and multiple mRNAs. The star-shape modules are instances of a triv¬ 
ial case that can be derived directly from miRNA-mRNA interaction scores 
rather than network analysis. 

Despite the statistical rigor, there are several limitations of the SNMNMF 
algorithm. First, the NMF approach requires a predefined number of modules 
in order to perform the matrix factorization, which may be data-dependent 
and difficult to determine beforehand. Additionally, the NMF solution is of¬ 
ten not unique, and the identified modules do not necessarily include both 
miRNAs and mRNAs, which makes reproducing and interpreting the results 
difficult. Moreover, the SNMNMF does not enforce negative MMEC (miRNA- 
mRNA expression correlation), whereas the negative MMEC is necessary to 
ascertain the repressive function of the miRNAs on the mRNAs within the 
MiRMs. Finally, SNMNMF incurs a high time complexity of 0(tk(s+m+n) 2 ) 
for t iterations, k modules, s samples, m miRNAs, and n mRNAs. Because n 
is usually large (e.g., 12456 genes in the ovarian cancer dataset), the compu¬ 
tation is expensive even for a small number of iterations or modules. Thus, 
an intuitively simple and efficient deterministic framework may serve as an 
attractive alternative, which we describe next. 

3.5 Mirsynergy: detecting synergistic miRNA regulatory modules 
by overlapping neighbourhood expansion 

In one of our recent works, we described a novel model called Mirsynergy 
that integrates m/miRNA expression profiles, target site information, and 
gene-gene interactions (GGI) to form MiRMs, where an m/miRNA may par¬ 
ticipate in multiple MiRMs, and the module number is systematically deter¬ 
mined given the predefined model parameters |100j . The clustering algorithm 
of Mirsynergy adapts from ClusterONE m, which was intended to identify 
protein complex from PPI data. The ultimate goal here however is to construct 
apriori the MiRMs and exploit them to better explain clinical outcomes such 
as patient survival rate. 
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We formulate the construction of synergistic miRNA regulatory modules 
(MiRMs) as an overlapping clustering problem with two main stages. Prior to 
the two clustering stages, we first inferred miRNA-mRNA interaction weights 
(MMIW) (W) using m/miRNA expression data and target site information. 
At stage 1, we only cluster miRNAs to greedily maximize miRNA-miRNA 
synergy, which is proportional to the correlation between miRNAs in terms 
of their MMIW. At stage 2, we fix the MiRM assignments and greedily add 
(remove) genes to (from) each MiRM to maximize the synergy score, which 
is defined as a function of the MMIW matrix and the gene-gene interaction 
weight (GGIW) matrix (H). 


Two-stage clustering 


Let W denote the expression-based N x M MMIW matrix obtained from 
the coefficients of a linear regression model such as LASSO, determined as 
the best performing target prediction model on our data, where Wi & is the 
scoring weight for miRNA k targeting mRNA i. Similar to the “Meet/Min” 
score defined by [1361 for binary interactions of co-occurring targets of miRNA 
pairs, we define an M x M scoring matrix denoted as S, indicating miRNA- 
miRNA synergistic scores between miRNA j and k (j ^ k ): 


Ej=l Wj,jWj,k 
min E i w i,j,'Ei w i,k\ 


Sj,k 


(57) 


Notably, if W were a binary matrix, Eq [W] became the ratio of number of 
targets shared between miRNA j and k over the minimum number of targets 
possessed by j or k, which is essentially the original “Meet/Min” score. We 
chose such scoring system to strictly reflect the overlapping between the two 
miRNA target repertoires rather than merely correlated trends as usually 
intended by alternative approaches such as Pearson correlation. 

Similar to the cohesiveness defined by uni, we define synergy score s(V/) 
for any given MiRM V c as follows. Let w ln (V c ) denote the total weights of the 
internal edges within the miRNA cluster, w bound (V c ) the total weights of the 
boundary edges connecting the miRNAs within V c to the miRNAs outside V c , 
and a(V c ) the penalty scores for forming cluster V c . The synergy of V c (i.e., 
the objective function) is: 


*(V C ) 


_ w m {V c ) _ 

w in (V c ) + w bound {V c ) + a(V c ) 


(58) 


where a(V c ) reflects our limited knowledge on potential unknown targets of 
the added miRNA as well as the false positive targets within the cluster. Pre¬ 
sumably, these unknown factors will affect our decision on whether miRNA k 
belong to cluster V c . For instance, miRNA may target noncoding RNAs and 
seedless targets, which are the mRNAs with no perfect seed-match [jO]. We 
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considered only mRNA targets with seed-match to minimize the number of 
false positives. By default, we set a(V c ) = 2\V C \, where \V C \ is the cardinal¬ 
ity of V c . Additionally, we define two scoring functions to assess the overlap 
oj(V c , V c >) between V c and V c > for c ^ d and the density d\{V c ) of any given 
V c : 


u(VM 


\v c nv c ,\ 2 

\Vc\\Vc>\ 


di(V c ) 


2w in {V c ) 
m(m — 1) 


(59) 

(60) 


where \V C D V c j is the total number of common elements in V c and V c j , and 
m is the number of miRNAs in V c . 

The general solution for solving an overlapping clustering problems is NP- 
hard m- Thus, we adapt a greedy-based approach |117j . The algorithm can 
be divided into two major steps. In step 1, we select as an initial seed rniRNA 
k with the highest total weights. We then grow an MiRM V t from seed k 
by iteratively including boundary or excluding internal miRNAs to maximize 
the synergy s(V t ) (Eq [58]) until no more node can be added or removed to 
improve s(Vt). We then pick another miRNA that has neither been considered 
as seed nor included in any previously expanded V t to form V t +\. The entire 
process terminates when all of the miRNAs are considered. In step 2, we 
treat the clusters as a graph with V c as nodes and oj(V c , V c >) > r as edges. 
Here r is a free parameter. Empirically, we observed that most MiRMs are 
quite distinct from one another in terms of uj(V c ,V c i) (before the merging). 
Accordingly, we set r to 0.8 to ensure merging only very similar MiRMs, which 
avoids producing very large MiRMs (when r is too small). We then perform a 
breath-first search to find all of the weakly connected components (CC), each 
containing clusters that can reach directly/indirectly to one another within 
the CC. We merge all of the clusters in the same CC and update the synergy 
score accordingly. 

After forming MiRMs at stage 1, we perform a similar clustering procedure 
by adding (removing) only the mRNAs to (from) each MiRM. Different from 
stage 1, however, we grow each existing MiRM separately with no prioritized 
seed selection or cluster merging, which allows us to implement a parallel 
computation by taking advantage of the multicore processors in the mod¬ 
ern computers. In growing/contracting each MiRM, we maximize the same 
synergy function (Eq 58 1 but changing the edge weight matrix from S to a 
(N + M) x (N + M) matrix by combining W (the N x M MMIW matrix) 
and H (the N x N GGIW matrix). Notably, here we assume miRNA-miRNA 
edges to be zero. Additionally, we do not add/remove miRNAs to/from the 
MiRM at each greedy step at this stage. Finally, we define a new density 
function due to the connectivity change at stage 2: 


<h(Vc) = 


w m (V c ) 
i(m. + n — 1) 


(61) 
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where n ( m ) are the number of mRNAs (miRNAs) in the V c . By default, 
we filter out MiRMs with d\[Vi) < le-2 and dviYj) < 5e-3 at stage 1 and 
2, respectively. Both density thresholds were chosen based on our empirical 
analyses. For some datasets, in particular, we found that our greedy approach 
tends to produce a very large cluster involving several hundred miRNAs or 
several thousand mRNAs at Stage 1 or 2, respectively, which are unlikely to 
be biologically meaningful. Despite the ever increasing synergy (by definition), 
however, the anomaly modules all have very low density scores, which allows 
us to filter them out using the above-chosen thresholds. 

Notably, standard clustering methods such as fc-means or hierarchical clus¬ 
tering are not suitable for constructing MiRMs since these methods assign 
each data point to a unique cluster [1591 . A recently developed greedy-based 
clustering method ClusterONE is more realistic because it allows overlap be¬ 
tween clusters m- However, ClusterONE was developed with physical PPI 
in mind. Mirsynergy extends from ClusterONE to detecting MiRMs. The nov¬ 
elty of our approach resides in a two-stage clustering strategy with each stage 
maximizing a synergy score as a function of either the miRNA-miRNA syner¬ 
gistic co-regulation or miRNA-mRNA/gene-gene interactions. Several meth¬ 
ods have incorporated GGI as PPI and TFBS into predicting MiRMs [1791195] . 
which proved to be a more accurate approach than using miRNA-mRNA 
alone. Comparing with recent methods such as SNMNMF 1TT91 and PIMiM 
[95] . however, an advantage of our deterministic formalism is the automatic 
determination of module number (given the predefined thresholds to merge 
and filter low quality clusters) and efficient computation with the theoretical 
bound reduced from 0(K(T + N + M) 2 ) per iteration to only 0(M(N + M)) 
for N ( M ) mRNA (miRNA) across T samples. Because N is usually much 
larger than M and T, our algorithm runs orders faster. Based on our tests on a 
linux server, Mirsynergy took about 2 hours including the run time for LASSO 
to compute OV (A=12456; M= 559; T= 385), BRCA or THCA (7V=13306; 
M=710; T=331 or 543, respectively), whereas SNMNMF took more than 
a day for each dataset. Using expression data for ovarian, breast, and thy¬ 
roid cancer from TCGA, we compared Mirsynergy with internal controls and 
existing methods including SNMNMF reviewed above. Mirsynergy-MiRMs 
exhibit significantly higher functional enrichment and more coherent miRNA- 
mRNA expression anti-correlation. Based on the Kaplan-Meier survival anal¬ 
ysis, we proposed several prognostically promising MiRMs and envisioned 
their utility in cancer research. Mirsynergy is available as an R/Bioconductor 
package at http://www.bioconductor.org/packages/release/bioc/html/ 
Mirsynergy.html 

The success of our model is likely attributable to its ability to explicitly 
leverage two types of information at each clustering stage: (1) the miRNA- 
miRNA synergism based on the correlation of the inferred miRNA target 
score profiles from MMIW matrix; (2) the combinatorial miRNA regulatory 
effects on existing genetic network, implicated in the combined MMIW and 
GGIW matrices. We also explored other model formulations such as cluster- 
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ing m/miRNAs in a single clustering stage or using different MMIW matrices 
other than the one produced from LASSO, which tends to produce MiRMs 
each containing only one or a few miRNAs or several very large low qual¬ 
ity MiRMs, which were then filtered out by the density threshold in either 
clustering stage. Notably, an MiRM containing only a single rniRNA can be 
directly derived from the MMIW without any clustering approach. More¬ 
over, Mirsynergy considers only neighbour nodes with nonzero edges. Thus, 
our model works the best on a sparse MMIW matrix such as the outputs 
from LASSO, which is the best performing expression-based methods based 
on our comparison with other alternatives. Nonetheless, the performance of 
Mirsynergy is sensitive to the quality of MMIW and GGIW. In this regard, 
other MMIW or GGIW matrices (generated from improved methods) can be 
easily incorporated into Mirsynergy as the function parameters by the users 
of the Bioconductor package (please refer to the package vignette for more 
details). In conclusion, with large amount of m/miRNA expression data be¬ 
coming available, we believe that Mirsynergy will serve as a powerful tool for 
analyzing condition-specific miRNA regulatory networks. 
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